Objective

Implement a pipeline that runs on a set of psuedo-distributed Map Reduce tasks to plot the mean delay of the 5 most active airline and for the five most active airports in the country.

Implementation

The entire execution is carried out in 2 map reduce stages.

Stage 1: Active airlines and airports

Map:

  • The map function in the first stage reads the input CSV line by line, parses it and counts the flight frequency per airline and per airport.

Partition:

  • In this phase, the partition class, partitions the key from the mapper into two different reducers by using the string “AL” and “AP” appended to the keys.

Reduce:

  • The reduce function gets an airline/airport as key and list of frequncies as value.
  • It adds the frequency value and finds the total frequency count of each airline and airport in the dataset.

Stage 2: Average monthly flight delay

Map:

  • In the setup function of the mapper, we read the output of phase 1, sort it and find the top 5 airlines and airports.
  • Then map function in the first stage reads each record and pre-processes it to see if it passes the sanity test and then sets it to the approriate Airline and/or Airport hashmap based on the calculated top 5 airport and airline.

Partition:

  • In this phase, the partition class, partitions each airport or airline by the month.
  • For example, for all airlines, the data for January goes to reducer 0 and so on till reducer 11 for December.
  • Similiarly for all airports, the data for Januar goes to recuder 12 and so on till reducer 23 for December.
  • The partitioner finds out whether the current record is for airport or airline by the first 2 characters in the key which are made either “AP” or “AL” and find out the month as the 4th element in the key when split on “" since key is either AL or AP_

Reduce:

  • The reduce function the reducer aggregrates all the normalized delays and the reduced flights for each airport and airline and finds the mean normalized delay for the same after which it writes these values to a comma seperated file as output.

Observation

As we explore the world of airline on-time performance, we come across a number of factors that affects the issue of flight delay’s : - airport capacity, weather condition, week influence, time influence, distance influence and regional influence. - So while calculating the mean delay of flight for each airport and airline, we should consider these factors to get a better understanding of trends and reasons for the delays. - From the results of top 5 airlines, we can see that there is no significant slope showing the reduction in the delay time over the past 28 years. - It seems from the graphs below, that the year 1999-2002 had a signifant increase in the delays than particularly when taking into consideration that the level of airline delays had been decreasing over the past year. - Also some seasonal trends can be observed from the data that during winters, there are a high chance of flight delays and cancellations due to weather condition or high traffic, while april-june months show less delays and cancellations.

Visualization

Top 5 Airlines

  • The plots below show the normalized delay for each of the top 5 five airlines at its 9 most frequent airports for the past 28 years.
  • The color gradient helps identify the frequency of the the flights at the given airports - darker the color - higher the frequency.
  • So when we select any single airport from a list of legends on the right, we can see the trend of normalized delays at that airport over the span of 28 years of the respective airline.

The above voilin plots gives a general idea of the normalized dealy distribuiled for each airline.

Top 5 Airports

  • The plots below show the normalized delay for each at the top 5 five airports by its 9 most frequent airlines for the past 28 years.
  • The color gradient helps identify the frequency of the the flights at the given airports - darker the color - higher the frequency.
  • So when we select any single airline from a list of legends on the right, we can see the trend of normalized delays by that airline at the respective airport over the span of 28 years.

The plots above gives a general idea of the normalized dealy distribuiled for each airports

AWS Execution Environment Specifications

Local Execution Environment Specifications: